.. _Generating_long: Generating a Longitudinal Dataset *********************************** .. |wp0101 wp| raw:: html "wp0101" .. |wp0101 wp2| raw:: latex \href{https://paneldata.org/soep-core/data/wp/wp0101}{\textbf{"wp0101"}} .. |xp0101 xp| raw:: html "xp0101" .. |xp0101 xp2| raw:: latex \href{https://paneldata.org/soep-core/data/xp/xp0101}{\textbf{"xp0101"}} .. |yp0101 yp| raw:: html "yp0101" .. |yp0101 yp2| raw:: latex \href{https://paneldata.org/soep-core/data/yp/yp0101}{\textbf{"yp0101"}} .. |wp9301 wp| raw:: html "wp9301" .. |wp9301 wp2| raw:: latex \href{https://paneldata.org/soep-core/data/wp/wp9301}{\textbf{"wp9301"}} .. |yp10601 yp| raw:: html "yp10601" .. |yp10601 yp2| raw:: latex \href{https://paneldata.org/soep-core/data/yp/yp10601}{\textbf{"yp10601"}} .. |emplst06 wpgen| raw:: html "emplst06" .. |emplst06 wpgen2| raw:: latex \href{https://paneldata.org/soep-core/data/wpgen/emplst06}{\textbf{"emplst06"}} .. |emplst07 xpgen| raw:: html "emplst07" .. |emplst07 xpgen2| raw:: latex \href{https://paneldata.org/soep-core/data/xpgen/emplst07}{\textbf{"emplst07"}} .. |emplst08 ypgen| raw:: html "emplst08" .. |emplst08 ypgen2| raw:: latex \href{https://paneldata.org/soep-core/data/ypgen/emplst08}{\textbf{"emplst08"}} .. |hinc06 whgen| raw:: html "hinc06" .. |hinc06 whgen2| raw:: latex \href{https://paneldata.org/soep-core/data/whgen/hinc06}{\textbf{"hinc06"}} .. |hinc07 xhgen| raw:: html "hinc07" .. |hinc07 xhgen2| raw:: latex \href{https://paneldata.org/soep-core/data/xhgen/hinc07}{\textbf{"hinc07"}} .. |hinc08 yhgen| raw:: html "hinc08" .. |hinc08 yhgen2| raw:: latex \href{https://paneldata.org/soep-core/data/yhgen/hinc08}{\textbf{"hinc08"}} .. |wphrf phrf| raw:: html "wphrf" .. |wphrf phrf2| raw:: latex \href{https://paneldata.org/soep-core/data/phrf/wphrf}{\textbf{"wphrf"}} .. |xphrf phrf| raw:: html "xphrf" .. |xphrf phrf2| raw:: latex \href{https://paneldata.org/soep-core/data/phrf/xphrf}{\textbf{"xphrf"}} .. |yphrf phrf| raw:: html "yphrf" .. |yphrf phrf2| raw:: latex \href{https://paneldata.org/soep-core/data/phrf/yphrf}{\textbf{"yphrf"}} .. |whhnr ppfad| raw:: html "hid_2006" .. |whhnr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/hid_2006}{\textbf{"hid_2006"}} .. |xhhnr ppfad| raw:: html "hid_2007" .. |xhhnr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/hid_2007}{\textbf{"hid_2007"}} .. |yhhnr ppfad| raw:: html "hid_2008" .. |yhhnr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/hid_2008}{\textbf{"hid_2008"}} .. |psample ppfad| raw:: html "psample" .. |psample ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/psample}{\textbf{"psample"}} .. |wpop ppfad| raw:: html "wpop" .. |wpop ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/wpop}{\textbf{"wpop"}} .. |xpop ppfad| raw:: html "xpop" .. |xpop ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/xpop}{\textbf{"xpop"}} .. |ypop ppfad| raw:: html "ypop" .. |ypop ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/ypop}{\textbf{"ypop"}} .. |persnr ppfad| raw:: html "pid" .. |persnr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/pid}{\textbf{"pid"}} .. |hhnr ppfad| raw:: html "cid" .. |hhnr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/cid}{\textbf{"cid"}} .. |sex ppfad| raw:: html "sex" .. |sex ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/sex}{\textbf{"sex"}} .. |gebjahr ppfad| raw:: html "gebjahr" .. |gebjahr ppfad2| raw:: latex \href{https://paneldata.org/soep-core/data/ppfad/gebjahr}{\textbf{"gebjahr"}} This example focuses on generating a dataset to analyze determinants of health satisfaction. You can either use the syntax generator in paneldata.org or write a syntax file yourself. You can search for variable names in Paneldata.org. In the previous examples, you created an exercise path with four subfolders as well as corresponding globals in the STATA do-file. You can use the same folders and globals for this exercise. **Create an exercise path with four subfolders:** .. figure:: png/uebungspfade.png :align: center **Example:** - H:/material/exercises/do - H:/material/exercises/output - H:/material/exercises/temp - H:/material/exercises/log **1.Generate an unbalanced panel dataset for the years 2006 to 2008 using paneldata.org if you wish. The dataset should contain all respondents in private households:** The data set should contain the following variables of interest: - health satisfaction |wp0101 wp| |wp0101 wp2| |xp0101 xp| |xp0101 xp2| |yp0101 yp| |yp0101 yp2| - currently smoking yes/no |wp9301 wp| |wp9301 wp2| |yp10601 yp| |yp10601 yp2| - current employment status |emplst06 wpgen| |emplst06 wpgen2| |emplst07 xpgen| |emplst07 xpgen2| |emplst08 ypgen| |emplst08 ypgen2| - monthly household net income |hinc06 whgen| |hinc06 whgen2| |hinc07 xhgen| |hinc07 xhgen2| |hinc08 yhgen| |hinc08 yhgen2| In addition, the dataset should include the following additional information for analysis from 2006 to 2008: - cross-sectional weighting factors for all relevant years |wphrf phrf| |wphrf phrf2| |xphrf phrf| |xphrf phrf2| |yphrf phrf| |yphrf phrf2| - individual identifier |persnr ppfad| |persnr ppfad2| - original household number |hhnr ppfad| |hhnr ppfad2| - household number for all relevant years |whhnr ppfad| |whhnr ppfad2| |xhhnr ppfad| |xhhnr ppfad2| |yhhnr ppfad| |yhhnr ppfad2| - sample membership |psample ppfad| |psample ppfad2| - sex |sex ppfad| |sex ppfad2| - year of birth |gebjahr ppfad| |gebjahr ppfad2| - population membership |wpop ppfad| |wpop ppfad2| |xpop ppfad| |xpop ppfad2| |ypop ppfad| |ypop ppfad2| If you need detailed instructions on how the script generator works in paneldata.org, you can find them in the chapter :ref:`syntax`. If you would like to assemble your dataset yourself, you can do this with the datasets you have assembled. From the previous exercise with tracking data, you may already have an idea where to get most of the variables. Since we want to have an unbalanced panel set, the $netto variable for the years 2006 to 2008 must also be used. In addition, our analysis must limit population membership, as we are only interested in household respondents. **1.1. Create a Master File** Use ppfad as the source file together with the required variables that you may have already found in Paneldata.org or identified from the variable label in the dataset. Note that only variables from the years to be analyzed should be used. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 16-19 Since we want to obtain an unbalanced data set, i.e., individuals who have completed an individual questionnaire at least once within the last three years, you must restrict the variable $netto (survey status). Also, we only want to analyze private households, so we need a further restriction of the $pop (sample membership) variable. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 21-39 What is still missing is the cross-sectional weighting factor and the variables of interest for the analysis. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 42-47 Now we come to the content variables. In order not to have to click through all of the datasets in the data release, it is recommended that the label be entered for the variable of interest from paneldata.org. Use the filter to narrow your search. Select our main study SOEP-Core, the search type "variable", the conceptual dataset “Original (raw folder)”, analysis unit and the corresponding year. Once you have clicked on the year of interest, a variable history is displayed. You can use this to see which years the variable was collected and what the variable is called. Example: Variable Label "Satisfaction Health" .. figure:: png/satisfaction_health.png :align: center Example: Variable Label "currently smoking yes/no" .. figure:: png/currently_smoke.png :align: center Example: Variable Label "current employment status" .. figure:: png/employment_status.png :align: center Example: Variable Label "monthly net household income" .. figure:: png/household_income.png :align: center To merge the data, you can either use the script generator on paneldata.org or write the syntax manually into a do-file. We now have all the information we need to create a master file. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 45-93 With the help of a unique identifier, which is either the household (hid_$) or individual identifier (pid), you can now merge all datasets or individual variables to ppfad. Which identifier to use when depends on the unit of analysis. Since we are on the individual level, our indicator is pid (individual identifier). We load the dataset ppfad and merge our datasets or variables to ppfad. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 96-116 **2. Encode missing values in system failings (STATA)!** After the master file has been created with all required information, the missing values, which can take between -1 to -8 in SOEP, must be recoded to missings. This step is important for converting a wide-format data set to a long format. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 128 **3. The data set is in "wide" format, i.e., additional years are displayed as additional variables (columns). For many analyses, it makes sense to convert datasets into the "long" format. In long format, additional years are displayed as additional lines. If the dataset covers three years, as in this example, there are three lines for each person. Convert the data set to long format using the STATA command reshape.!** Since these are cross-sectional variables, it can be assumed that each variable has at least one wave abbreviation, which makes the variable unique. Conversely, this means that the variables must be renamed before the reshape command. Before renaming all original variables (e.g., from $P data sets) it must be checked whether the question and the answer categories were the same in all years (you can also look up the exact wording of the question in the corresponding questionnaire). If changes are made, the variables may have to be recoded. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 136-139 How you rename the variables is largely up to you. However, you should ensure that the name remains consistent over time and that the variable only differs according to the year (variable name + four-digit year suffix, e.g., zufr2006, zufr2007, zufr2008). You can rename the variables either manually, line by line, or for advanced users using a loop. Example of manual renaming: .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 141-148 Example of a loop: .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 151-163 **3.1. The reshape command** Now that we have made all relevant preparations, you can start to convert the dataset. If you want to convert a dataset, you can do this in both directions: .. figure:: png/aufgabe_3_reshape.png :align: center In our case, we reshape from wide to long. This means that a new variable name must be assigned for the year of the survey (j). The variable is then generated automatically. Currently, each person is assigned a line in Stata. .. csv-table:: :header-rows: 1 :file: csv/reshape-wide.csv .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 167-170 After the reshape command, you have one line per year for each person: .. csv-table:: :header-rows: 1 :file: csv/reshape-long.csv **4. Perform analyses based on the data. Try to answer the following questions:** **a. Has men's and women's average satisfaction with health changed over the three years?** Satisfaction with health was measured on a scale from 1 to 10, with a value of 10 representing the highest possible level of satisfaction. To compare the average satisfaction with health between women and men, you should display the mean value for both sexes. The mean value is displayed weighted here. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 183 .. figure:: png/mean_health.png :align: center The output shows the average values for men and women for all three years. The first three values show men's average satisfaction with health between 2006 and 2008, while the last three values show women's average satisfaction with health. **b. What is the proportion of people for whom health satisfaction has increased from 2006 to 2007?** To answer this question, the difference between 2006 and 2007 should be displayed. You should make sure that the analysis is conducted only within one persnr (individual identifier) and only for satisfaction in the following year. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 191-193 .. figure:: png/compare_health_unweighted.png :align: center Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 195 .. figure:: png/compare_health_weighted.png :align: center The values less than 0 show a deterioration in health satisfaction. The value 0 means constant health satisfaction, and all values above 0 show a positive change in satisfaction with their health. With a value of 10, it can be assumed that these people were interviewed for the first time in 2007 or 2008. **c. In what direction and how much has satisfaction with health changed from 2006 to 2008 among people who quit smoking after 2006?** The procedure is similar to the previous question, except that the element "smoke yes/no" is added. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 209-217 .. figure:: png/smoke_vs_health.png :align: center To obtain a weighted mean value, address the analysis weight after the generated variable. .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 218 .. figure:: png/smoke_vs_health_weight.png :align: center This illustration shows the mean of the health variable under the condition of the quit variable that we generated beforehand. With a mean of -0.24 (weighted -0.35), the biggest change in health satisfaction is seen in people who quit smoking after 2006. For example, if a person smoked in 2006 and indicated a satisfaction value of 8, the person indicates a satisfaction value of 7.76 after he/she stopped smoking in 2008. So you can assume that when a person stops smoking, their perceived health state deteriorates. Now we have to test if the assumption is correct. **d. Does quitting smoking make your health worse? To what extent could the result of the analysis "stop smoking" be distorted?** In order to establish a connection between health satisfaction and stopping smoking, one should use the t-test or to be more specific, the one-sample t-test. It checks whether the mean value of a sample deviates significantly from a known expected value (specified in the null hypothesis). .. literalinclude:: docs/Längsschnittdata_Uebung.do :linenos: :lines: 241-242 .. figure:: png/ttest.png :align: center *H0 Hypothesis: If one stops smoking, it has no effect on health.* For this test we assume a 95% probability. What we want to check now is whether the H0 hypothesis can be rejected or not. If you look at the output of the test, you first see the mean value of 1 (quit smoking) of the variable quit. The last line of the output shows the significance level. If it falls below the value 0.05, one can speak of a statistically significant result. In our example, the null hypothesis can be discarded because its value is less than 0.05 percent. So quitting smoking has a significant impact on a person's perceived health. Last change: |today|